verkko-fillet : post verkko graph and assembly cleaning in Python¶
verkko-fillet is an easy-to-use toolkit for cleaning Verkko assemblies. It includes tools for generating the Verkko-Fillet object, performing assembly quality checks, identifying gaps, assigning chromosomes, searching for ONT reads to help resolve gaps, filling gaps, and generating the final Rukki path (in a GAF-like format) for future Verkko CNS runs.
This Python-based implementation streamlines the entire process, starting right after the Verkko assembly is completed and preparing for the CNS run.
%load_ext autoreload
%autoreload 2
import sys
import importlib
import pandas as pd
import time
import os
from IPython.display import Image, display
pd.set_option('mode.chained_assignment', None)
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import warnings
import session_info
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sys.path.append("/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/")
# import verkkoFillet as vf
import verkkoFillet as vf
importlib.reload(vf)
<module 'verkkoFillet' from '/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/verkkoFillet/__init__.py'>
The verkkoFillet module, abbreviated as vf, consists of three sub-modules: pp, tl, and pl.
vf.tl: Provides tools for running shell scripts.vf.pp: Includes modules for preprocessing and modifying datasets.vf.pl: Dedicated to plotting and visualization.
Loading verkko directory and building the verkko-fillet object¶
The input file is very simple. You only need the Verkko output directory, as it is well-structured.
verkkoDir="/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/"
os.chdir(verkkoDir)
obj = vf.pp.read_Verkko(verkkoDir)
Loading paths.tsv file... Loading Ont alignment GAF file...
obj
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ paths: name, path, assignment gaf: Qname, path, mapQ, identity, path_modi
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/giraffe_complete_verkko-verkko_filletObj.png"
display(Image(filename=image_path,width=700))
The Verkko Fillet object will read all files in the Verkko output directory, including the final Rukki path file, ONT alignment, and others, if they exist. However, don't worry if your Verkko directory does not include these files. In this Jupyter notebook, we will generate all the necessary files and statistics and load them into the Verkko Fillet object later. Once you build your own Verkko Fillet object, you can save it locally as a Python pickle file, allowing you to load it anytime for future curation tasks.
You can use the Verkko Fillet Python module to view and edit the object using specific functions. Additionally, you can easily access each piece of data stored in the object through its attributes (e.g., obj.verkkoDir).
When you print the object, it will display the types of data stored within it.
Getting the t2t stats¶
Before starting curation and polishing the assembly, it is helpful to examine the statistics and quality of the initial assembly.
Firstly, run the vf.pp.getT2T function to retrieve the T2T statistics, including the number of scaffolds, contigs, telomeres, and gaps. This function generates four files in the Verkko output directory using the assembly.fasta file:
assembly.telomere.bedassembly.gaps.bedassembly.t2t_scfsassembly.t2t_ctgs
%%time
vf.tl.getT2T(obj)
getT2T was done! CPU times: user 1.07 ms, sys: 211 ms, total: 212 ms Wall time: 6.16 s
Collapsing rDNA nodes.¶
As default, vf.tl.rmrDNA function uses homopolymer-compressed rDNA morph to screen and collapse in the Verkko graph, along with the assembly.telomere.bed file generated above using vf.tl.getT2T(obj). This process marks telomere nodes at the end of the contigs and generates three output files:
target.screennodes.out: This file includes the nodes that have rDNA sequences.assembly.homopolymer-compressed.noseq.telo_rdna.gfa: The graph with the rDNA collapsed and telomere nodes added.assembly.colors.telo_rdna.csv: Adds telomere nodes marked in green (#008000) and rDNA nodes marked in purple (#A020F0), in addition to the originalassembly.colors.csv.
vf.tl.rmrDNA(obj)
Starting removing rDNA nodes in the graph remove rDNA was done! Output files: target.screennodes.out assembly.homopolymer-compressed.noseq.telo_rdna.gfa assembly.colors.telo_rdna.csv
QV calculation¶
QV (Quality Value) is fundamental metric for assessing the quality of an assembly.
The input for QV calculation is a list of high-quality sequencing data, such as HiFi reads, used for the Verkko assembly. K-mers are extracted separately from the sequencing data and the reference assembly. The number of shared k-mers between the two is then used to calculate the QV.
- A high QV indicates a high-quality assembly, where most k-mers are shared with the original high-quality sequencing reads.
- A low QV suggests that the assembly contains many errors not present in the high-quality sequencing data.
fofn = "/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.1/10-evaluation/kmers/fofn/child.fofn"
the kmer databases for each data and assembly will generatedd by meryl
%%time
vf.tl.makeMeryl(obj, fofn = fofn, mergury_dir= "/gpfs/gsfs11/users/kimj75/program/merqury-1.3")
CPU times: user 1 µs, sys: 0 ns, total: 1 µs Wall time: 3.1 µs
%%time
vf.tl.calQV(obj, mergury_dir= "/gpfs/gsfs11/users/kimj75/program/merqury-1.3")
CPU times: user 0 ns, sys: 25 µs, total: 25 µs Wall time: 46.3 µs
obj = vf.pp.getQV(obj)
obj
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ paths: name, path, assignment gaf: Qname, path, mapQ, identity, path_modi qv: asmName, nKmer_uniq_asm, nKmer_total, QV, ErrorRate
obj.qv
| asmName | nKmer_uniq_asm | nKmer_total | QV | ErrorRate |
|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
Each column shows:
- Assembly of interest: This is a combination of the two datasets mentioned above.
- K-mers uniquely found only in the assembly
- Total K-mers found in the assembly
- QV (Quality Value)
- Error rate
For more detailed information, please visit the Merqury GitHub.
vf.pl.qvPlot(obj)
/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/verkkoFillet/plotting/_baseQC.py:45: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
Chromosome assignment¶
To check which chromosomes are contigs and to pair the same chromosomes from two haplotypes, chromosome assignment is crucial. The vf.tl.chrAssign function will run Minimap to align the assembly to a user-provided reference FASTA file and obtain high-confidence chromosome assignments for each contig in the assembly. Once this is done, we will load not only the chromosome assignment results but also the T2T stats generated earlier using the vf.pp.getT2T function.
The chromosome.map file has two columns:
- One column contains the names of contigs associated with each primary chromosome (starting with CN, CM, etc., in the reference FASTA file).
- The other column contains the names of the primary chromosomes (e.g., chr1, chr2, etc.).
These contig names need to be converted into easy-to-read chromosome names.
ref = "/vf/users/Phillippy/projects/giraffeT2T/assembly/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.fna"
mapfile="/vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/chromosome.map"
%%time
vf.tl.chrAssign(obj = obj, ref = ref, chr_name="CM")
get Chr name was done! Standard Output: Using custom reference sequence /vf/users/Phillippy/projects/giraffeT2T/assembly/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.fna 113 sire_ dam_ compNC: . regNC: CM
obj = vf.pp.readChr(obj,mapfile)
The chromosome infomation was stored in obj.stats
obj
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness paths: name, path, assignment gaf: Qname, path, mapQ, identity, path_modi qv: asmName, nKmer_uniq_asm, nKmer_total, QV, ErrorRate
obj.stats.head()
| contig | scf_ctg | ref_chr | contig_len | ref_chr_len | hap | chr | completeness |
|---|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
tab = obj.stats.groupby(['hap','chr'])['contig'].count().reset_index()
# tab.loc[tab['contig'] > 2,:]
tab
| hap | chr | contig |
|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
vf.pl.completePlot(obj)
# the chromosome ordering should be fixed
vf.pl.contigLenPlot(obj)
vf.pl.contigPlot(obj)
- Brick Red: Contig (complete T2T without gaps)
- Salmon Color: Scaffold (T2T with gaps or tangles)
- Baige : Not a scaffold (missing one of the telomere)
Compleasm¶
# to be added
Lets fillet the verkko assembly!¶
Verkko is a very powerful tool for assembling long-read data to generate a complete diploid assembly, but it still has some limitations, such as generating gaps, tangles, missing telomeres, and more. These issues can occur due to sequencing errors, lower sample quality, or overly complex or homogeneous sequences, which are difficult to assemble. As a result, some manual inspection is needed. Below, we classify the scenarios that can lead to gaps in the assembly and their potential solutions.
1. Missing Telomeres
Broken Contig: This can occur when large tangles or gaps exist in the middle of a contig, causing Rukki to split the contig. This can be resolved by connecting the contigs with a gap in the Rukki path.
Internal Telomere Sequencing: In this case, additional sequences are added after the telomere, preventing the
getT2Tfunction from detecting the telomere and reporting the contig as a scaffold, even if it contains an internal telomere. This issue might arise due to ONT sequencing bias and can be fixed by trimming the sequences before the internal telomere after identifying its start position.Missing Sequence at the End: If you see isolated telomere-attached nodes, you can find homologous nodes to determine the counterpart of the haplotype on the chromosome and stitch them together by inserting a gap.
2. Gaps, Bubbles, and Tangles
rDNA: The number of rDNA arrays and their morphs vary by species, and rDNA sequences are too similar to be assembled automatically. We recommend running
ribotinto find the consensus of the rDNA sequences and patch the assembly with the rDNA consensus.Complex Tangles: Sometimes, Hifiasm can generate longer contigs that cover tangles in the Verkko assembly. We can use the HiFi assembly to align it on the Verkko graph and get hints from the alignment to resolve the walk on the nodes.
Simple Tangles or Bubbles: In this scenario, ONT alignments on the Verkko graph using
graphalignercan be helpful. Searching for supported paths that connect nodes in the gap can help resolve simple tangles.Edge Missing: If your assembly has missing edges between two nodes, this might be due to too few supported ONT reads (<3) connecting the nodes. We can find the supported split ONT reads aligned on nearby nodes and add the edge to the graph.
Loops: Repeated sequences with multiple copies can be problematic to assemble, especially when the repeat is long enough to span ONT reads at both ends of the flanking nodes. In this case, we edit the gaps with the estimated number of loops and fill the gaps with semi-filled gaps.
3. Haplotype Unassignment
One Haplotypes is Ambiguous: This scenario occurs when heterozygous nodes flank long runs of homozygous nodes, making Verkko unsure which haplotype should be assigned. If we have evidence that the other haplotype has already been assigned to one of the nodes in the gap, we can assign the other haplotype to the ambiguous node.
Both Haplotypes are Ambiguous: When both haplotypes have gaps in a bubble, there’s no way to determine the correct haplotype for the gap. In this case, we will randomly assign the haplotypes. This can be corrected during the polishing step later.
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/verkko_curation_decision_tree.png"
display(Image(filename=image_path,width=1500, height = 500))
Missing Telomere¶
3. Internal telomere sequencing detection¶
Sometimes, additional sequences are added to the end of a contig beyond the telomere. As a result, the vf.tl.getT2T function classifies them as not scaffolds because it cannot detect telomere sequences at the end of the contig.
To check if your assembly has this kind of issue, you can use the following functions:
vf.tl.run_intra_telo(): Finds all telomere sequences at the whole-genome level.vf.pp.find_intra_telo(): Helps identify which contigs contain internal telomere sequences.
You can trim these sequences later, after running the final Verkko consensus step.
# vf.tl.run_intra_telo(obj)
file = "/vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.0.4/giraffe/assembly.windows.0.5.bed"
vf.pp.find_intra_telo(obj, file=file, loc_from_end=15000)
| contig | ref_chr | chr | hap | start | end | len_fai |
|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
The dam_compressed.k31.hapmer-0000002 contig (chr11 mat) was reported as a scaffold (see the vf.pl.contigPlot(obj) plot) because the start telomere begins inside the telomeric region (18,800 bp from the first base at one end). This caused getT2T to miss it. We will trim this contig after running Verkko-consensus to finalize the assembly and fix the issue.
Gaps / Bubbles / Tangles¶
Finding gaps from path¶
Using the Rukki final paths that are already loaded in the object, we can retrieve the gaps with flanking nodes and the names of the contigs using the vf.pp.findGaps function. This function will store the list of gaps in the obj.gaps attribute, and you can access the table.
%%time
obj = vf.pp.findGaps(obj)
43 gaps were found -> obj.gaps CPU times: user 495 ms, sys: 554 ms, total: 1.05 s Wall time: 1.05 s
print(obj)
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness paths: name, path, assignment, gaps gaps: gapId, name, gaps, notes, fixedPath, startMatch, endMatch, finalGaf, done gaf: Qname, path, mapQ, identity, path_modi paths_freq: path, size, path_modi qv: asmName, nKmer_uniq_asm, nKmer_intersect, QV, ErrorRate
The first three columns are generated from the Rukki path, and the others are the columns we will fill during gap filling. The gaps column contains each gap along with the flanking nodes, allowing you to easily visualize how the gaps or bubbles look using the flanking nodes in Bandage.
obj.gaps
| gapId | name | gaps | notes | fixedPath | startMatch | endMatch | finalGaf | done | |
|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
Align ONT Reads onto the graph¶
All ONT datasets used during the Verkko assembly are stored in the Verkko output directory. We use the ONT reads and align them to the Verkko graph using graphAligner. This alignment helps us create reasonable paths through the gaps and bubbles. In future versions of Verkko, this step will be integrated into the assembly process.
vf.tl.graphIdx() will index the assembly.homopolymer-compressed.gfa graph, and vf.tl.graphAlign() uses the ONT reads under the 3-align/split directory. The output GAF file will be saved to the 9-graphAlignment/verkko.graphAlign_allONT.gaf.
vf.tl.graphIdx(obj,threads=1)
vf.tl.graphAlign(obj = obj, threads=50)
obj = vf.pp.readGaf(obj)
Looking for GAF file at: /vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/9-graphAlignment/verkko.graphAlign_allONT.gaf Loading ONT alignment GAF file... GAF file successfully loaded.
obj
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness paths: name, path, assignment, gaps gaps: gapId, name, gaps, notes, fixedPath, startMatch, endMatch, finalGaf, done gaf: Qname, path, mapQ, identity, path_modi paths_freq: path, size, path_modi qv: asmName, nKmer_uniq_asm, nKmer_intersect, QV, ErrorRate
obj.gaf.head(2)
| Qname | len | path | mapQ | identity | path_modi |
|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
Searching the nodes in the ONT alingment GAF file.¶
Once you have loaded the ONT alignment into your Verkko Fillet object, you can search using two nodes to view the number of paths that span both nodes using the vf.pp.searchNodes function. The output will have 4 columns:
- path: The ONT alignment path.
- size: The number of supported reads for the path.
- node1:
Yif the path spans node1. - node2:
Yif the path spans node2.
%%time
node_list_input = ['utig4-317', 'utig4-2181']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists. Extracting paths containing nodes: ['utig4-317', 'utig4-2181'] CPU times: user 25.9 ms, sys: 1.99 ms, total: 27.9 ms Wall time: 27.8 ms
| path | size | @utig4-317@ | @utig4-2181@ | |
|---|---|---|---|---|
| 14894 | >utig4-317<utig4-2324<utig4-2181 | 2 | Y | Y |
| 14892 | >utig4-317 | 13920 | Y | |
| 6472 | <utig4-317 | 13589 | Y | |
| 3796 | <utig4-2181 | 779 | Y | |
| 12405 | >utig4-2181 | 658 | Y | |
| 6473 | <utig4-317>utig4-316 | 93 | Y | |
| 3797 | <utig4-2181>utig4-2180 | 90 | Y | |
| 4223 | <utig4-2324<utig4-2181 | 88 | Y | |
| 6470 | <utig4-316>utig4-317 | 88 | Y | |
| 3794 | <utig4-2180>utig4-2181 | 87 | Y | |
| 12406 | >utig4-2181>utig4-2324 | 86 | Y | |
| 12849 | >utig4-2324<utig4-317 | 85 | Y | |
| 14893 | >utig4-317<utig4-2324 | 78 | Y | |
| 3798 | <utig4-2181>utig4-2180<utig4-1681 | 1 | Y | |
| 6474 | <utig4-317>utig4-316>utig4-319 | 1 | Y | |
| 6475 | <utig4-317>utig4-316>utig4-320 | 1 | Y | |
| 6480 | <utig4-319<utig4-316>utig4-317 | 1 | Y |
3. Simple Tangles or Bubbles¶
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit18.png"
display(Image(filename=image_path,width=500, height = 500))
%%time
node_list_input = ['utig4-424', 'utig4-1283']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists. Extracting paths containing nodes: ['utig4-424', 'utig4-1283'] CPU times: user 10.4 ms, sys: 260 µs, total: 10.7 ms Wall time: 9.88 ms
| path | size | @utig4-424@ | @utig4-1283@ | |
|---|---|---|---|---|
| 829 | <utig4-1283<utig4-1281<utig4-424<utig4-421 | 1 | Y | Y |
| 827 | <utig4-1283 | 49416 | Y | |
| 9406 | >utig4-1283 | 49028 | Y | |
| 4658 | <utig4-2456<utig4-1283 | 89 | Y | |
| 828 | <utig4-1283<utig4-1281 | 81 | Y | |
| 9407 | >utig4-1283>utig4-2456 | 76 | Y | |
| 9402 | >utig4-1281>utig4-1283 | 70 | Y | |
| 15164 | >utig4-421>utig4-424>utig4-1281 | 66 | Y | |
| 822 | <utig4-1281<utig4-424<utig4-421 | 62 | Y | |
| 6763 | <utig4-424<utig4-421 | 7 | Y | |
| 15163 | >utig4-421>utig4-424 | 2 | Y | |
| 821 | <utig4-1281<utig4-424 | 1 | Y | |
| 11578 | >utig4-1901<utig4-2456<utig4-1283 | 1 | Y | |
| 15173 | >utig4-424>utig4-1281 | 1 | Y |
vf.pp.highlight_nodes(obj, node = "utig4-1282+")
| name | path | assignment | gaps |
|---|---|---|---|
| dam_compressed.k31.hapmer_from_utig4-1282 | utig4-1774-,utig4-1771-,utig4-1773+,utig4-2684+,utig4-2600-,utig4-2596-,utig4-2597+,utig4-2636-,utig4-1227-,utig4-1225-,utig4-878-,utig4-875-,utig4-877+,utig4-2373+,utig4-2374+,utig4-2452-,utig4-692-,utig4-688+,utig4-693+,utig4-1779+,utig4-1781+,utig4-2402+,utig4-1978-,utig4-1974-,utig4-1976+,utig4-2502+,utig4-2018-,utig4-2017+,utig4-2020+,utig4-2676-,utig4-223-,utig4-221+,utig4-225+,utig4-1809+,utig4-1810+,utig4-2721+,utig4-1790-,utig4-1786-,utig4-1787+,utig4-1993-,utig4-1995+,utig4-2426+,utig4-2427+,utig4-2575-,utig4-721-,utig4-720-,utig4-308-,utig4-306+,utig4-309+,utig4-983+,utig4-984+,utig4-2448+,utig4-415-,utig4-413-,utig4-398-,utig4-394-,utig4-395+,utig4-2602-,utig4-1857-,utig4-1855+,utig4-949-,utig4-946-,utig4-947+,utig4-2268+,utig4-2269+,utig4-2642+,utig4-515-,utig4-512-,utig4-513+,utig4-2161+,utig4-1793-,utig4-1791+,utig4-1470-,utig4-1469-,utig4-1251-,utig4-1250+,utig4-1253+,utig4-2573-,utig4-199-,utig4-197+,utig4-200+,utig4-2000+,utig4-2002+,utig4-2403+,utig4-1642-,utig4-1592-,utig4-1589+,utig4-1593+,utig4-1818-,utig4-1819+,utig4-2388+,utig4-2390+,utig4-2490-,utig4-2386-,utig4-2384+,utig4-2257-,utig4-2255-,utig4-474-,utig4-472+,utig4-475+,utig4-2363+,utig4-2327-,utig4-2325+,utig4-2328+,utig4-2650-,utig4-422-,utig4-421+,utig4-425+,utig4-1281+,utig4-1282+,utig4-2456+,utig4-1902-,utig4-1898-,utig4-1899+,utig4-2106-,utig4-2108+,utig4-2305+,utig4-2192-,utig4-2189-,utig4-2190+,utig4-2238+,utig4-1077-,utig4-1074-,utig4-1076+,utig4-2166-,utig4-1101-,utig4-1098-,utig4-1100+,utig4-2751-,utig4-2752+ | DAM_COMPRESSED.K31.HAPMER | None |
In this scenario, we find one supporting ONT alignment spanning the gap and flanking reads. However, the number of supporting reads is too small, so we also check the other haplotype paths by using the vf.pp.highlight_nodes() function with the counterpart nodes on the other haplotype to see if another haplotype uses the other nodes in the bubble. If so, as shown here, we can assign the node utig-424 to the paternal strand to fill the gap.
If you have enough evidence for filling the gaps (or partially filling them), you can use this information to fill the gaps using the vf.pp.fillGaps function. You can easily extract the order and direction of each node using Bandage tools (Output > Specify exact path for copy/save). This information will be stored in the obj.gaps attribute and will be used for generating the edited Rukki path for the Verkko CNS run.
- Caution: Be careful to maintain the original node direction when filling the gaps.
%%time
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_19',
final_path="utig4-424+, utig4-1281+, utig4-1283+")
final path : >utig4-424>utig4-1281>utig4-1283 Updated gapId gapid_19! ✅ The start node and its direction match the original node. ✅ The start node and its direction match the original node. [= ] 1/43 gaps filled CPU times: user 3.05 ms, sys: 0 ns, total: 3.05 ms Wall time: 3 ms
Once you edit your path, it would be great to check the obj.gaps to ensure that both the startMatch and the endMatch are set to "match." The vf.pp.fillGaps() function also provides this information when you edit the path. These two pieces of information not only indicate the starting and ending nodes but also account for the direction of the nodes. If you provide reverted nodes, each field will be filled with "notMatch." If this is unintentional, it should be corrected before completing the gap-filling process.
Here, we show how the vf.pp.fillGaps() function reports when we provide different directions for the end nodes.
%%time
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_19',
final_path="utig4-424+, utig4-1281+, utig4-1283-")
final path : >utig4-424>utig4-1281<utig4-1283 Updated gapId gapid_19! ✅ The start node and its direction match the original node. ❌ The start node and its direction do not match the original node. [= ] 1/43 gaps filled CPU times: user 2.81 ms, sys: 0 ns, total: 2.81 ms Wall time: 2.77 ms
It says that the start node and its direction match, but the end node does not match.
If you want to modify the gap information you have already added, simply use the same function to overwrite it. Additionally, if you want to reset the gap information, set final_path="" in vf.pp.fillGaps.
%%time
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_19',
final_path="")
gapId gapid_19: 'final_path' is empty. Other columns have been reset to 'NA'. [ ] 0/43 gaps filled CPU times: user 2.65 ms, sys: 0 ns, total: 2.65 ms Wall time: 2.59 ms
4. Edge Missing¶
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit17.png"
display(Image(filename=image_path,width=500, height = 500))
The vf.pp.searchSplit function will find the split reads that are aligned to two nodes that are not connected.
%%time
node_list_input = ["utig4-2329","utig4-2651"]
vf.pp.searchSplit(obj,node_list_input)
CPU times: user 7.19 s, sys: 66.8 ms, total: 7.25 s Wall time: 7.23 s
| Qname | path_modi | |
|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
split_reads = vf.pp.searchSplit(obj,node_list_input)
vf.tl.insertGap(obj, gapid = "gapid_18", split_reads = split_reads)
Extracting reads...
The split reads for gapid_18 was saved to 11.cleaning/missing_edge/gapid_18.missing_edge.ont_list.txt
The fill the edge was done for gapid_18!
The final path looks like:
{'>utig4-2651<gapmanual-1-len--4567-cov-4<utig4-2329', '>utig4-2329>gapmanual-1-len--4567-cov-4<utig4-2651'}
Troubleshooting : only two reads should be included in the subset.gaf. if you give more than two alignment from one reads, it gives you empty patch.gaf.
%%time
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_18',
final_path="utig4-2329+,gapmanual-1-len--4492-cov-2+,utig4-2651-")
final path : >utig4-2329>gapmanual-1-len--4492-cov-2<utig4-2651 Updated gapId gapid_18! ✅ The start node and its direction match the original node. ✅ The start node and its direction match the original node. [===== ] 5/43 gaps filled CPU times: user 9.37 ms, sys: 0 ns, total: 9.37 ms Wall time: 9.28 ms
4. Loops¶
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit2.png"
display(Image(filename=image_path,width=500, height = 500))
This is one of the rDNA clusters in the Giraffe genome. The node utig4-2421 has a loop, and we don't know how many times the node exists in each haplotype strand. Here, we estimate the minimum number of loops for the node on the maternal strand by finding the number of loop nodes in each ONT alignment that spans the haplotype-assigned nodes near the loop node(in this case, utig4-100 and utig4-495). We then fill the gaps with the estimated number of loops, along with the flanking new gaps. This allows us to build an assembly that may not be perfect but is gradually improving. Additionally, multiple loops could enhance the read alignment in future use.
%%time
vf.pp.estLoops(obj, ["utig4-100", "utig4-495"])
CPU times: user 488 ms, sys: 1.33 s, total: 1.81 s Wall time: 7.42 s
The most frequent number of loops in the ONT path spanning the loop node and the haplotype-assigned nodes is 3 for this gap. So, we fill this gap with 3 copies of the loop node, flanked by new gaps named loop_uncertain_copy. Additionally, for this case, we add a gap at the end of the new path, which could lead to an "unmatch" warning in the report. However, this is intentionally done, so we can ignore the warning here.
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_38',
final_path="utig4-100+, [N5000N:loop_uncertain_copy],utig4-2421+,utig4-2421+,utig4-2421+,[N5000N:loop_uncertain_copy]")
final path : >utig4-100[N5000N:loop_uncertain_copy]>utig4-2421>utig4-2421>utig4-2421[N5000N:loop_uncertain_copy] Updated gapId gapid_38! ✅ The start node and its direction match the original node. ❌ The start node and its direction do not match the original node. [== ] 2/43 gaps filled
vf.pp.highlight_nodes(obj, node = "utig4-495")
| name | path | assignment | gaps |
|---|---|---|---|
| dam_compressed.k31.hapmer_from_utig4-1104 | utig4-1107-,utig4-1103-,utig4-1104+,utig4-1511-,utig4-1512+,utig4-1562-,utig4-1561+,utig4-1564+,utig4-1967+,utig4-817-,utig4-813-,utig4-815+,utig4-2301+,utig4-2302+,utig4-2391+,utig4-1294-,utig4-1292+,utig4-1296+,utig4-1454+,utig4-1455+,utig4-2461-,utig4-2463+,utig4-2467+,utig4-2526+,utig4-565-,utig4-564+,utig4-568+,utig4-1057-,utig4-662-,utig4-659-,utig4-661+,utig4-1606+,utig4-1608+,utig4-1609+,utig4-2565+,utig4-2412-,utig4-2410-,utig4-1817-,utig4-1815+,utig4-67-,utig4-66+,utig4-70+,[N212780N:tangle],utig4-100+,[N5000N:ambig_path],utig4-2421+,utig4-495-,utig4-494+,utig4-497+,utig4-1094-,utig4-1096+,utig4-2419-,utig4-2214-,utig4-2212+,utig4-2216+,utig4-2217+,utig4-450-,utig4-449+,utig4-452+,utig4-2290-,utig4-1272-,utig4-1271+,utig4-1274+,utig4-2229-,utig4-1878-,utig4-1877+,utig4-1881+,utig4-2480+,utig4-2335-,utig4-2332-,utig4-2334+,utig4-2477-,utig4-2478+,utig4-1431-,utig4-1429-,utig4-1427+,utig4-1064-,utig4-1063+,utig4-1066+,utig4-1782-,utig4-1784+,utig4-2662+,utig4-2594-,utig4-2592+,utig4-1132-,utig4-1130-,[N5000N:ambig_path],utig4-1082+,utig4-1085+,utig4-2082-,utig4-2084+,utig4-2086+,utig4-2168+,utig4-1949-,utig4-1948+,utig4-1813-,utig4-1812-,utig4-1551-,utig4-1549-,utig4-1268-,utig4-1266+,utig4-1270+,utig4-2441+,utig4-645-,utig4-641-,utig4-642+,utig4-1408+,utig4-1410+,utig4-1924-,utig4-1925+,utig4-2246-,utig4-686-,utig4-683-,utig4-685+,utig4-2099-,utig4-2101+,utig4-2618-,utig4-2245-,utig4-2243-,utig4-1120-,utig4-1116-,utig4-1117+,utig4-1453+,utig4-968-,utig4-967-,utig4-596-,utig4-592-,utig4-593+,utig4-1041+,utig4-368-,utig4-367+,utig4-370+ | DAM_COMPRESSED.K31.HAPMER | None |
6.Homologous nodes but hint from other haplotype¶
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/gapid_24.png"
display(Image(filename=image_path,width=500, height = 500))
%%time
node_list_input = ['utig4-282', 'utig4-2090']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists. Extracting paths containing nodes: ['utig4-282', 'utig4-2090'] CPU times: user 7.98 ms, sys: 1 ms, total: 8.98 ms Wall time: 8.33 ms
| path | size | @utig4-282@ | @utig4-2090@ | |
|---|---|---|---|---|
| 6386 | <utig4-282>utig4-283>utig4-2090 | 90 | Y | Y |
| 3498 | <utig4-2090<utig4-283>utig4-282 | 88 | Y | Y |
| 3500 | <utig4-2090<utig4-284>utig4-282 | 85 | Y | Y |
| 6388 | <utig4-282>utig4-284>utig4-2090 | 82 | Y | Y |
| 3496 | <utig4-2090 | 1792 | Y | |
| 12124 | >utig4-2090 | 1773 | Y | |
| 14794 | >utig4-282 | 559 | Y | |
| 6384 | <utig4-282 | 543 | Y | |
| 6392 | <utig4-285<utig4-282 | 101 | Y | |
| 6394 | <utig4-286<utig4-282 | 94 | Y | |
| 14796 | >utig4-282>utig4-286 | 92 | Y | |
| 7877 | <utig4-808>utig4-809<utig4-2090 | 90 | Y | |
| 14795 | >utig4-282>utig4-285 | 86 | Y | |
| 12126 | >utig4-2090<utig4-809>utig4-808 | 76 | Y | |
| 7879 | <utig4-808>utig4-810<utig4-2090 | 74 | Y | |
| 12128 | >utig4-2090<utig4-810>utig4-808 | 72 | Y | |
| 3497 | <utig4-2090<utig4-283 | 9 | Y | |
| 16352 | >utig4-809<utig4-2090 | 9 | Y | |
| 6389 | <utig4-283>utig4-282 | 8 | Y | |
| 12125 | >utig4-2090<utig4-809 | 6 | Y | |
| 14797 | >utig4-283>utig4-2090 | 6 | Y | |
| 14798 | >utig4-284>utig4-2090 | 6 | Y | |
| 6385 | <utig4-282>utig4-283 | 5 | Y | |
| 12127 | >utig4-2090<utig4-810 | 5 | Y | |
| 16390 | >utig4-810<utig4-2090 | 5 | Y | |
| 6387 | <utig4-282>utig4-284 | 3 | Y | |
| 6390 | <utig4-284>utig4-282 | 3 | Y | |
| 3499 | <utig4-2090<utig4-284 | 1 | Y |
Here, we can identify two clear paths: <utig4-282>utig4-283>utig4-2090 and <utig4-282>utig4-284>utig4-2090. However, we are unable to determine the haplotype using ONT alignment. In such cases, one of the obvious paths might already be used by another haplotype. Thus, we can assign the haplotype harboring this gap to the other obvious path, using an elimination method.
vf.pp.highlight_nodes(obj, node = "utig4-282-")
| name | path | assignment | gaps |
|---|---|---|---|
| sire_compressed.k31.hapmer_from_utig4-457 | utig4-2548+,utig4-2549+,utig4-2661-,utig4-916-,utig4-913-,utig4-915+,utig4-1121-,utig4-1122+,utig4-2641-,utig4-1380-,utig4-1379+,utig4-1382+,utig4-2677-,utig4-2585-,utig4-2584-,utig4-2483-,utig4-2482-,utig4-2026-,utig4-2023-,utig4-2025+,utig4-2601+,utig4-285-,utig4-282-,[N5000N:ambig_path],utig4-2090+,utig4-809-,utig4-808+,utig4-812+,utig4-2691+,utig4-355-,utig4-353+,utig4-357+,utig4-1350-,utig4-1351+,[N5000N:ambig_path],utig4-6+,utig4-14+,utig4-1317+,utig4-1320+,utig4-2617+,utig4-867-,utig4-865-,utig4-866+,utig4-869+,utig4-1097+,utig4-840-,utig4-839+,utig4-53-,utig4-51+,utig4-54+,utig4-1979-,utig4-456-,utig4-454+,utig4-457+,utig4-1300-,utig4-1302+,utig4-2170+,utig4-1373-,utig4-1371+,utig4-1375+,utig4-2262-,utig4-1946-,utig4-1943-,utig4-1945+,utig4-2512-,utig4-2514+,utig4-40-,[N5000N:ambig_path],utig4-39+,[N1000N:ambig_bubble],utig4-119-,utig4-120+,utig4-1010-,utig4-1012+,utig4-1263-,utig4-1265+,utig4-2622+,utig4-2473-,utig4-2471+,utig4-1448-,utig4-1447+,utig4-1451+,utig4-2186-,utig4-2187+,utig4-1969-,utig4-1968-,utig4-470-,utig4-467-,utig4-469+,utig4-887+,utig4-888+,utig4-2438-,utig4-2440+,utig4-2469+,utig4-2199-,utig4-2198+,utig4-2202+,utig4-2251-,utig4-1669-,utig4-1667+,utig4-1671+,utig4-1868+,utig4-803-,utig4-800-,utig4-802+,utig4-885+,utig4-511-,utig4-509+,utig4-765-,utig4-766+,utig4-2122-,utig4-2124+,utig4-2366+,utig4-352-,utig4-348-,utig4-350+,utig4-2621+,utig4-1188-,utig4-1186+,utig4-1190+,utig4-1191+,utig4-2078-,utig4-2077-,utig4-1141-,utig4-1139+,utig4-1092-,utig4-1089-,utig4-1091+,utig4-1389+,utig4-1391+,utig4-2008-,utig4-28-,utig4-27+,utig4-31+,utig4-823+,utig4-825+ | SIRE_COMPRESSED.K31.HAPMER | None |
| dam_compressed.k31.hapmer_from_utig4-458 | utig4-2548+,utig4-2550+,utig4-2661-,utig4-917-,utig4-913-,utig4-914+,utig4-1121-,utig4-1123+,utig4-2641-,utig4-1381-,utig4-1379+,utig4-1383+,utig4-2677-,utig4-2586-,utig4-2584-,utig4-2484-,utig4-2482-,utig4-2027-,utig4-2023-,utig4-2024+,utig4-2601+,utig4-286-,utig4-282-,utig4-283+,utig4-2090+,utig4-810-,utig4-808+,utig4-811+,utig4-2691+,utig4-354-,utig4-353+,utig4-356+,utig4-1350-,utig4-1318-,utig4-1317+,utig4-1319+,utig4-2617+,utig4-868-,utig4-865-,utig4-289-,utig4-287+,utig4-1097+,utig4-841-,utig4-839+,utig4-52-,utig4-51+,utig4-55+,utig4-1979-,utig4-455-,utig4-454+,utig4-458+,utig4-1300-,utig4-1301+,utig4-2170+,utig4-1372-,utig4-1371+,utig4-1374+,utig4-2262-,utig4-1947-,utig4-1943-,utig4-1944+,utig4-2512-,utig4-2513+,utig4-2515+,utig4-2612-,utig4-2610-,utig4-2608+,[N5000N:ambig_path],utig4-2527-,utig4-1013-,utig4-1010-,utig4-1011+,utig4-1263-,utig4-1264+,utig4-2622+,utig4-2472-,utig4-2471+,utig4-1449-,utig4-1447+,utig4-1450+,utig4-2186-,utig4-1972-,utig4-1971-,utig4-1968-,utig4-471-,utig4-467-,utig4-468+,utig4-887+,utig4-889+,utig4-2438-,utig4-2439+,utig4-2469+,utig4-2199-,utig4-2198+,utig4-2203+,utig4-2251-,utig4-1668-,utig4-1667+,utig4-1670+,utig4-1868+,utig4-804-,utig4-800-,utig4-801+,utig4-885+,utig4-886+,utig4-768-,utig4-765-,utig4-767+,utig4-2122-,utig4-2123+,utig4-2366+,utig4-351-,utig4-348-,utig4-349+,utig4-2621+,utig4-1187-,utig4-1186+,utig4-1189+,utig4-2077-,utig4-1140-,utig4-1139+,utig4-1093-,utig4-1089-,utig4-1090+,utig4-1389+,utig4-1390+,utig4-2008-,utig4-29-,utig4-27+,utig4-30+,utig4-823+,utig4-824+ | DAM_COMPRESSED.K31.HAPMER | None |
When we search for the utig4-282- node in the paths, the maternal (dam) strand has already used the utig4-283+ node. Therefore, we can use the utig4-284+ node to fill the gap on the paternal (sire) strand.
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_24',
final_path="utig4-282-, utig4-284+, utig4-2090+")
final path : <utig4-282>utig4-284>utig4-2090 Updated gapId gapid_24! ✅ The start node and its direction match the original node. ✅ The start node and its direction match the original node. [=== ] 3/43 gaps filled
7. Homologous without hint¶
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/gapid_10.png"
display(Image(filename=image_path,width=500, height = 500))
%%time
node_list_input = ['utig4-2613', 'utig4-626']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists. Extracting paths containing nodes: ['utig4-2613', 'utig4-626'] CPU times: user 24.8 ms, sys: 1e+03 µs, total: 25.8 ms Wall time: 25.5 ms
| path | size | @utig4-2613@ | @utig4-626@ | |
|---|---|---|---|---|
| 15775 | >utig4-626>utig4-629<utig4-2613 | 108 | Y | Y |
| 13876 | >utig4-2613<utig4-629<utig4-626 | 101 | Y | Y |
| 13878 | >utig4-2613<utig4-630<utig4-626 | 57 | Y | Y |
| 15777 | >utig4-626>utig4-630<utig4-2613 | 53 | Y | Y |
| 7300 | <utig4-626 | 5187 | Y | |
| 15773 | >utig4-626 | 5173 | Y | |
| 13874 | >utig4-2613 | 2813 | Y | |
| 5199 | <utig4-2613 | 2752 | Y | |
| 15728 | >utig4-612>utig4-616>utig4-2613 | 112 | Y | |
| 5201 | <utig4-2613<utig4-615<utig4-612 | 86 | Y | |
| 5203 | <utig4-2613<utig4-616<utig4-612 | 83 | Y | |
| 15726 | >utig4-612>utig4-615>utig4-2613 | 76 | Y | |
| 7305 | <utig4-626>utig4-628<utig4-2521 | 68 | Y | |
| 7302 | <utig4-626>utig4-627<utig4-2521 | 62 | Y | |
| 13536 | >utig4-2521<utig4-628>utig4-626 | 54 | Y | |
| 13534 | >utig4-2521<utig4-627>utig4-626 | 49 | Y | |
| 7311 | <utig4-628>utig4-626 | 38 | Y | |
| 7309 | <utig4-627>utig4-626 | 31 | Y | |
| 7301 | <utig4-626>utig4-627 | 30 | Y | |
| 15776 | >utig4-626>utig4-630 | 30 | Y | |
| 7304 | <utig4-626>utig4-628 | 26 | Y | |
| 7315 | <utig4-630<utig4-626 | 26 | Y | |
| 7306 | <utig4-626>utig4-628<utig4-2521<utig4-532 | 23 | Y | |
| 7303 | <utig4-626>utig4-627<utig4-2521<utig4-532 | 20 | Y | |
| 15790 | >utig4-630<utig4-2613 | 20 | Y | |
| 13877 | >utig4-2613<utig4-630 | 14 | Y | |
| 7312 | <utig4-629<utig4-626 | 6 | Y | |
| 5202 | <utig4-2613<utig4-616 | 5 | Y | |
| 13875 | >utig4-2613<utig4-629 | 5 | Y | |
| 15731 | >utig4-615>utig4-2613 | 5 | Y | |
| 15788 | >utig4-629<utig4-2613 | 5 | Y | |
| 15732 | >utig4-616>utig4-2613 | 4 | Y | |
| 5200 | <utig4-2613<utig4-615 | 3 | Y | |
| 15513 | >utig4-532>utig4-2521<utig4-627>utig4-626 | 2 | Y | |
| 15774 | >utig4-626>utig4-629 | 2 | Y | |
| 7307 | <utig4-626>utig4-628<utig4-2521<utig4-534 | 1 | Y | |
| 14849 | >utig4-302>utig4-532>utig4-2521<utig4-628>utig4-626 | 1 | Y | |
| 14856 | >utig4-303>utig4-534>utig4-2521<utig4-627>utig4-626 | 1 | Y | |
| 15518 | >utig4-534>utig4-2521<utig4-628>utig4-626 | 1 | Y |
vf.pp.highlight_nodes(obj, node = "utig4-2613+")
| name | path | assignment | gaps |
|---|---|---|---|
| dam_compressed.k31.hapmer_from_utig4-1425 | utig4-857-,utig4-854-,utig4-851-,utig4-853+,utig4-2710+,utig4-2454-,utig4-2453-,utig4-59-,utig4-58+,utig4-62+,utig4-2589-,utig4-1175-,utig4-1173-,utig4-613-,utig4-612+,utig4-616+,utig4-2613+,[N1000N:ambig_bubble],utig4-626-,utig4-627+,utig4-2521-,utig4-534-,utig4-303-,utig4-301+,utig4-305+,utig4-669-,utig4-671+,utig4-928-,utig4-929+,utig4-2675+,utig4-1922-,utig4-1920+,utig4-73-,utig4-71+,utig4-74+,utig4-1424-,utig4-1425+,utig4-1845-,utig4-1847+,utig4-2685-,utig4-1587-,utig4-1584-,utig4-1585+,utig4-2457-,utig4-2050-,utig4-2049+,utig4-2052+,utig4-2291-,utig4-1197-,utig4-1193-,utig4-1195+,utig4-2194+,utig4-1865-,utig4-1863+,utig4-1867+,utig4-2382-,utig4-589-,utig4-587+,utig4-591+,utig4-1247+,utig4-1248+,utig4-2016+,utig4-1357-,utig4-1355+,utig4-1358+,utig4-2383-,utig4-2241-,utig4-2239-,utig4-1420-,utig4-1416-,utig4-1417+,utig4-1762+,utig4-1764+,utig4-2554-,utig4-2553-,utig4-2551+,utig4-1767-,utig4-1765+,utig4-1769+,utig4-2487-,utig4-963-,utig4-962+,utig4-966+,utig4-2357-,utig4-902-,utig4-898-,utig4-899+,utig4-1675-,utig4-1677+,utig4-2225+,utig4-1653-,utig4-1651+,utig4-111-,utig4-112+,utig4-2250+,utig4-1259-,utig4-1257+,utig4-1260+,utig4-2540-,utig4-2160-,utig4-2158+,utig4-2092-,utig4-2091+,utig4-2094+,utig4-2494-,utig4-2010-,utig4-2009-,utig4-1723-,utig4-1725+,utig4-1727+,utig4-2420+,utig4-1639-,utig4-1636-,utig4-1637+,utig4-2620-,utig4-1953-,utig4-1951+,utig4-1954+,utig4-2188+,utig4-1402-,utig4-1400- | DAM_COMPRESSED.K31.HAPMER | None |
| sire_compressed.k31.hapmer_from_utig4-1426 | utig4-996-,[N34856N:tangle],utig4-856-,utig4-854-,utig4-851-,utig4-852+,utig4-2710+,utig4-2455-,utig4-2453-,utig4-60-,utig4-58+,utig4-61+,utig4-2589-,utig4-1174-,utig4-1173-,utig4-614-,utig4-612+,utig4-615+,utig4-2613+,[N1000N:ambig_bubble],utig4-626-,utig4-628+,utig4-2521-,utig4-532-,utig4-302-,utig4-301+,utig4-304+,utig4-669-,utig4-670+,utig4-928-,utig4-930+,utig4-2675+,utig4-1921-,utig4-1920+,utig4-72-,utig4-71+,utig4-75+,utig4-1424-,utig4-1426+,utig4-1845-,utig4-1846+,utig4-2685-,utig4-1588-,utig4-1584-,utig4-1586+,utig4-2457-,utig4-2051-,utig4-2049+,utig4-2053+,utig4-2291-,utig4-1196-,utig4-1193-,utig4-1194+,utig4-2194+,utig4-1864-,utig4-1863+,utig4-1866+,utig4-2382-,utig4-588-,utig4-587+,utig4-590+,utig4-1247+,utig4-1249+,utig4-2016+,utig4-1356-,utig4-1355+,utig4-1359+,utig4-2383-,utig4-2240-,utig4-2239-,utig4-1419-,utig4-1416-,utig4-1418+,[N5000N:ambig_path],utig4-1763+,utig4-2554-,utig4-2552-,utig4-2551+,utig4-1766-,utig4-1765+,utig4-1768+,utig4-2487-,utig4-964-,utig4-962+,utig4-965+,utig4-2357-,utig4-901-,utig4-898-,utig4-900+,utig4-1675-,utig4-1676+,utig4-2225+,utig4-1652-,utig4-1651+,utig4-1127-,utig4-1128+,utig4-2250+,utig4-1258-,utig4-1257+,utig4-1261+,utig4-2540-,utig4-2159-,utig4-2158+,utig4-2093-,utig4-2091+,utig4-2095+,utig4-2494-,utig4-2011-,utig4-2009-,utig4-1728-,utig4-1725+,utig4-1726+,[N5000N:ambig_path],utig4-1640-,utig4-1636-,utig4-1638+,utig4-2620-,utig4-1952-,utig4-1951+,utig4-1955+,utig4-2188+,utig4-1401-,utig4-1400- | SIRE_COMPRESSED.K31.HAPMER | None |
This scenario has the same haplotype unassignment issue as above, but there is no hint from the other haplotype. Therefore, we randomly assign the haplotype to each gap using the paths obtained from vf.pp.searchNodes(obj, node_list_input). This haplotype-switching error can be corrected during the polishing step later.
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_10',
final_path="utig4-2613+, utig4-630-, utig4-626-")
final path : >utig4-2613<utig4-630<utig4-626 Updated gapId gapid_10! ✅ The start node and its direction match the original node. ✅ The start node and its direction match the original node. [==== ] 4/43 gaps filled
obj = vf.pp.fillGaps(obj=obj,
gapId='gapid_12',
final_path="utig4-2613+, utig4-629-, utig4-626-")
final path : >utig4-2613<utig4-629<utig4-626 Updated gapId gapid_12! ✅ The start node and its direction match the original node. ✅ The start node and its direction match the original node. [===== ] 5/43 gaps filled
Finalize paths¶
vf.pp.checkGapFilling(obj)
[===== ] 5/43 gaps filled
| gapId | name | gaps | notes | fixedPath | startMatch | endMatch | finalGaf | done | |
|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.2.4 from the init_notebook_mode cell...
(need help?) |
The final Rukki path file to be used in the Verkko CNS run can be generated using the fixed gap file in the obj and written to the local directory for future use.
vf.pp.writeFixedGaf(obj)
Reading original rukki path from 8-hicPipeline/rukki.paths.gaf Writing fixed rukki path to final_rukki_fixed.paths.gaf
Save fillet obj¶
The Verkko-fillet object can be saved as a Python pickle object using the vf.pp.save_Verkko() function. This object will be written to the Verkko output directory.
%%time
fileName = "verkko-fillet.pkl"
vf.pp.save_Verkko(obj, fileName)
save verkko fllet obj to -> verkko-fillet.pkl CPU times: user 9.31 s, sys: 1.45 s, total: 10.8 s Wall time: 11.6 s
session_info.show()
Click to view session information
----- itables 2.2.3 pandas 2.2.3 session_info 1.0.0 verkkoFillet NA -----
Click to view modules imported as dependencies
PIL 10.0.0 anyio NA arrow 1.2.3 asttokens NA attr 23.1.0 attrs 23.1.0 babel 2.12.1 certifi 2023.07.22 cffi 1.15.1 charset_normalizer 3.2.0 comm 0.1.4 cycler 0.10.0 cython_runtime NA dateutil 2.9.0.post0 debugpy 1.6.7.post1 decorator 5.1.1 defusedxml 0.7.1 exceptiongroup 1.2.2 executing 2.1.0 fastjsonschema NA fqdn NA idna 3.4 importlib_metadata NA importlib_resources NA ipykernel 6.25.1 ipywidgets 8.1.0 isoduration NA jedi 0.19.2 jinja2 3.1.2 json5 NA jsonpointer 2.4 jsonschema 4.19.0 jsonschema_specifications NA jupyter_events 0.7.0 jupyter_server 2.7.2 jupyterlab_server 2.24.0 kaleido 0.2.1 kiwisolver 1.4.4 markupsafe 2.1.3 matplotlib 3.9.3 matplotlib_inline 0.1.7 mpl_toolkits NA natsort 8.4.0 nbformat 5.9.2 numpy 1.26.0 overrides NA packaging 23.1 parso 0.8.4 patsy 0.5.4 pexpect 4.9.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.10.0 plotly 5.16.1 prometheus_client NA prompt_toolkit 3.0.48 psutil 5.9.5 ptyprocess 0.7.0 pure_eval 0.2.3 pyarrow 17.0.0 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.18.0 pyparsing 3.0.9 pythonjsonlogger NA pytz 2024.1 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA scipy 1.11.4 seaborn 0.12.2 send2trash NA six 1.16.0 sniffio 1.3.0 stack_data 0.6.2 statsmodels 0.14.1 tenacity NA tornado 6.3.3 tqdm 4.66.1 traitlets 5.14.3 typing_extensions NA uri_template NA urllib3 2.0.4 wcwidth 0.2.13 webcolors 1.13 websocket 1.6.1 yaml 6.0.1 zipp NA zmq 25.1.1 zoneinfo NA
----- IPython 8.18.1 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyterlab 4.0.5 notebook 7.0.2 ----- Python 3.9.17 | packaged by conda-forge | (main, Aug 10 2023, 07:02:31) [GCC 12.3.0] Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28 ----- Session information updated at 2024-12-11 00:45
Once you save the Verkko-Fillet object locally, you can load all datasets by loading the object itself, without needing to read them individually next time.
%%time
fileName = "verkko-fillet.pkl"
obj = vf.pp.load_Verkko(fileName)
load verkko fllet obj from <- verkko-fillet.pkl CPU times: user 18min 41s, sys: 7.15 s, total: 18min 48s Wall time: 18min 52s
Make new verkko directory for running verkko consensus¶
To run the Verkko consensus steps using the manual gap-filling path file, the first step is to generate the directory structure required for Verkko to start from the consensus step. The vf.pp.mkCNSdir function helps create this structure in the specified location by copying or creating symbolic links from the Verkko output directory.
%%time
new_folder_path = "/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test"
vf.pp.mkCNSdir(obj, new_folder_path)
Symbolic links and files created from /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ to /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test ├── 1-buildGraph ├── 2-processGraph ├── 3-align ├── 3-alignTips ├── 4-processONT ├── 5-untip ├── 6-layoutContigs ├── 6-rukki ├── 7-consensus └── hifi-corrected.fasta.gz CPU times: user 0 ns, sys: 7.88 s, total: 7.88 s Wall time: 19.5 s
Update the new Verkko directory to include ONT support for filling the edge at blunt missing edge gaps, if your assembly has this issue.
%%time
vf.pp.updateCNSdir_missingEdges(obj, new_folder_path)